Since the revolutionary draft of Neanderthal genome https://science.sciencemag.org/content/328/5979/710 paper, a few studies had been carried out to systematically establish the maps of regions of Neanderthal introgression into modern humans. Those regions had were hypothesized to have multiple functional effects related to skin color, male fertility etc.
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'neanderthal_dna.png', width=2000)
The current dominating hypothesis of human evolution is that anatomically modern humans originated in Africa and migrated to Europe and Asia approximately ~ 50 000 years ago. In Aeurope and Asia they met Neanderthals and Denisovans, who the latter ones happened to come to Europe and Asia is not clear. We know that modern humans interbred with Neanderthals and Denisovans and had common offsprings. Modern humans of non-African ancestry have estimated fraction of 2%-5% of Neanderthal DNA. This became know in 2010 when the group of Svante Pääbo sequenced draft Neanderthal genome.
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'BriefHistory.png', width=2000)
Since the sequencing of the draft Neanderthal genome there were a few studies that systematically attempted to detect regions of Neanderthal introgression in modern human genomes using mostly individuals from the 1000G project:
All the methods of detecting regions of Neanderthal introgression are based on camparing a test modern genome of European or Asian ancestry with the high-coverage Neanderthal and Denisovan genomes one the one hand and with sub-Saharan African (Yoroba) genome on the other hand. The candidate regions should be as similar as possible to the Neanderthal / Denisovan DNA and as divergent as possible from the African genome. Usually a test statistic such as Conditional Random Field (CRF), S* or Hidden Markov Model (HMM) are used in sliding window across the whole test genome. The disadvantage of those methods is their memoryless nature, i.e. no memory about nucleotide sequence is kept in the model. Here we will develop a Deep Learning based method for detecting regions of Neanderthal introgression in modern human genomes. The advantage of this approach is the long memory about the nucleotide sequence which possibly will bring new interesting candidate regions compared to the previous methods.
Before diving into Deep Learning modeles for calling regions of Neanderthal introgression in modern genomes, let us perform a simpler experiment and try to classify sequences that belong to either gene or intergenic regions. Since both Reich and Akey did their 1000G Neanderthal Introgression calls using hg19 version of human reference genome, we downloaded the hg19 version from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ and prepared it for fast sequence extraction with samtools faidx. Next, we are going to build an annotation file for protein-coding genes. We downloaded the RefSeq annotation file from http://genome.ucsc.edu/cgi-bin/hgTables for hg19 as a text-file and used "genePredToGtf" tool to build the refGene_hg19.gtf gtf-annotation file. The gtf-annotation looks messy, it includes both gene and exon annotation, we will select only gene annotation with the keyword "transcript" in the third column.
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes
awk -v OFS='\t' '{if($3=="transcript") print $1,$4,$5,$10}' refGene_hg19.gtf | \
tr -d '\"' | sed 's/\;//g' | sort | uniq > gene_coords.txt
echo
head gene_coords.txt
echo
wc -l gene_coords.txt
We can see that we have 40996 RefSeq genes which sounds a lot as we know that human genome has approximately ~20 000 protein coding genes. This descrepancy is explained by the non-uniqueness of the gene symbols in the last column. You can see that there are at least two HPS1 genes with different coordinates. This is a knwon fact, that is why they have Ensembl gene IDs which are unique, so that one gene symbol can correcspond to multiple Ensembl IDs.
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes
cut -f4 gene_coords.txt | sort | uniq | wc -l
If we check how many unique gene symbols we have, we get 27 565 genes which is closer to the number of protein coding genes in human genome. Further, if we exclude non-coding RNA and LOC-genes, we get almost the correct number of protein coding genes.
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes
cut -f4 gene_coords.txt | sort | uniq | grep -v "LOC" | grep -v "MIR" | grep -v "LINC" | wc -l
For now we are going to keep all the 40 996 genes assuming that they all have unique Ensembl ID even though they might be duplicates in sense of gene symbol. Now we are going to read the gene coordinates into Python and plot the distribution of gene lengths.
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt', header=None, sep="\t")
gene_coords.sort_values(by = [0, 1, 2], inplace = True)
gene_coords.to_csv(Path + 'gene_coords.txt', index = False, header = False, sep = '\t')
gene_coords.head()
gene_coords.shape
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
gene_lengths = gene_coords.iloc[:, 2]-gene_coords.iloc[:, 1]
sns.distplot(gene_lengths)
plt.title("Distribution of Gene Lengths", fontsize = 20)
plt.xlabel("Lengths of Genes", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
from scipy import stats
print(stats.describe(gene_lengths))
We can see that the lengths of genes vary from minimal length 19 bp up to 2.3 Mbp with the mean length 52 kbp. These numbers are good to keep in mind for further downstream analysis. Now we are going to use the coordinates of the genes and the human reference genome hg19 fasta-file which we downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ in oder to extract the sequences of the genes using the samtools. One can of course do it in Python, but samtools is much faster, so we will use samtools here, but run it from Python.
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
with open('hg19_gene_regions.fa', 'a') as fp:
for i in range(gene_coords.shape[0]):
coord = str(str(gene_coords.iloc[i, 0]) + ':'
+ str(gene_coords.iloc[i, 1]) + '-' + str(gene_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
Now we have a fasta-file with 40 996 nucleotide sequences of genes from hg19 human genome. Since we want to learn DNA motifs that discriminate between genes and non-genes, we need to build another fasta-file with 40 996 sequences of non-gene (intergenic) regions, each of them is of exactly the same length as the corresponding gene regions. To do this, we will need to know the lengths of each chromosome in order to randomly draw intergenic regions. When indexing the hg19 reference genome, samtools created a fai-file which contains the lengths of the cromosomes in the second column, we will read and sort it:
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes = chr_sizes.drop([2, 3, 4], axis = 1)
#chr_sizes = chr_sizes[chr_sizes[0].isin(['chr' + str(i) for i in list(range(1,23)) + ['X', 'Y']])]
#sex_chr_sizes = chr_sizes[chr_sizes[0].str.match('|'.join(['chrX','chrY']))]
#chr_sizes = chr_sizes.drop(chr_sizes[chr_sizes[0] == 'chrX'].index, axis = 0)
#chr_sizes = chr_sizes.drop(chr_sizes[chr_sizes[0] == 'chrY'].index, axis = 0)
#chr_sizes['temp'] = chr_sizes[0].str.split('chr').str[1]
#chr_sizes['temp'] = chr_sizes['temp'].astype(int)
#chr_sizes = chr_sizes.sort_values('temp')
#chr_sizes = chr_sizes.drop('temp', axis = 1)
#chr_sizes = chr_sizes.append(sex_chr_sizes)
chr_sizes.head()
Now for each gene in the gene_coords DataFrame, we are going to randomly draw a region of the same length as the gene on the same chromosome and check whether this region overlaps with any other gene (not only with this one) on the same chromosome. If it does not, we will add this region to the notgene_coords DataFrame. If it does overlap, we repeat the random drawing of the region gain and again until we succeeed in selecting a truly intergenic region of the same length as the given gene.
import numpy as np
chr_list = []
start_list = []
end_list = []
gene_lengths = list(gene_coords.iloc[:, 2] - gene_coords.iloc[:, 1])
a = 0
for i in range(gene_coords.shape[0]):
chr_df = gene_coords[gene_coords[0].isin([gene_coords.iloc[i,0]])]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords.iloc[i,0]].iloc[:,1]))
reg_end = reg_start + gene_lengths[i]
for j in range(chr_df.shape[0]):
b1 = chr_df.iloc[j,1]
b2 = chr_df.iloc[j,2]
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2) or \
(b1 > reg_start and b1 < reg_end) or (b2 > reg_start and b2 < reg_end):
overlap = True
break
else:
overlap = False
chr_list.append(gene_coords.iloc[i,0])
start_list.append(reg_start)
end_list.append(reg_end)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' genes')
notgene_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
notgene_coords.to_csv("notgene_coords.txt", index = False, header = False, sep = "\t")
notgene_coords.head()
Let us now conform that the gene and intergenic regions indeed do not overlap. We will use bedtools intersect for this purpose. This command returns coordinate of interesects between two sets of genetic regions. If it does not return anything, it implies there are no intersects between the gene and intergenic regions, as we want to have it.
!bedtools intersect -a gene_coords.txt -b notgene_coords.txt | wc -l
We get zero intersects meaning that we indeed managed to build a set of intergenic regions that do not overlap with the gene regions. Now we are going to use the set of intergenic regions and extract the sequences of hg19 human reference genome corresponding to the intergenic coordinates. We wgain will use samtools for this purpose.
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
with open('hg19_notgene_regions.fa', 'a') as fp:
for i in range(notgene_coords.shape[0]):
coord = str(str(notgene_coords.iloc[i, 0]) + ':'
+ str(notgene_coords.iloc[i, 1]) + '-' + str(notgene_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
Everything looks good at first glance, howeever taking a closer look we can see that we have quite a few sequences with N-nulceotides. This is a kind of missing data since N implies nucleotides that could not be identified by the sequencer. A fast way to see the presense of N-nucleoties is to grep them.
!grep -c N hg19_gene_regions.fa
!grep -c N hg19_notgene_regions.fa
However we know that samtools splits sequences into multiple lines, so grep does not actually detect the number of N-containing entries but the number of N-containing lines. In order to be more precise, we can quickly count using Bio Python the numbers of entries in each fasta-file that contain at least one N-nucleotide.
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_gene_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print('Gene regions file contains ' + str(i) + ' entries with at least one N-nucleotide')
j = 0
for record in SeqIO.parse('hg19_notgene_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
j = j + 1
print('Intergenic regions file contains ' + str(j) + ' entries with at least one N-nucleotide')
In case fractions of missing data are different between gene and intergenic regions, this might be treated as a signal by Convolutional Neural Networks (CNNs), and this is not what we are interested in. Therefore for simplicity we will remove all entries containing N-nucleotides in either gene or intergenic regions. For example, if a gene region contains at least one N-nucleotides, we will drop the region together with the corresponding intergenic region despite that one might not contain N-nucleotides. This is needed for a better correspondence between the gene and intergenic fasta-files.
from Bio import SeqIO
gene_file = 'hg19_gene_regions.fa'
notgene_file = 'hg19_notgene_regions.fa'
a = 0
i = 0
with open('hg19_gene_clean.fa', 'a') as gene_out, open('hg19_notgene_clean.fa', 'a') as notgene_out:
for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
upper_gene = gene.seq.upper()
upper_notgene = notgene.seq.upper()
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
if 'N' not in str(upper_gene) and 'N' not in str(upper_notgene):
gene.seq = upper_gene
SeqIO.write(gene, gene_out, 'fasta')
notgene.seq = upper_notgene
SeqIO.write(notgene, notgene_out, 'fasta')
i = i + 1
else:
continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Thus we have removed approximately 7000 entries, but still have plenty of sequences left to run CNNs for gene vs. intergenic region classification. Now let us quickly check using grep whether we indeed are free of N-containing sequences.
!grep -c N hg19_gene_clean.fa
!grep -c N hg19_notgene_clean.fa
Looks good! Now it is time to start building the input matrix of sequences to be fed int othe CNN. Since I have limited memory on my laptop, I will not read all the sequences into memory but extract first cut = 500 nucleotides from each sequence. The gene and intergenic sequences of length less than cut = 500 nucleotides will be ignored and not included into the CNN input matrix. We also have to make sure that all 4 nucleotides are present at each sequence, i.e. we will omit e.g. AT or CG repeat sequences.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
gene_file = 'hg19_gene_clean.fa'
notgene_file = 'hg19_notgene_clean.fa'
a = 0
gene_seqs = []
notgene_seqs = []
for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
cut = 500
if len(str(gene.seq)) < cut or len(str(notgene.seq)) < cut:
continue
s_gene = str(gene.seq)[0:cut]
s_notgene = str(notgene.seq)[0:cut]
if s_gene.count('A')>0 and s_gene.count('C')>0 and s_gene.count('G')>0 and s_gene.count('T')>0 and \
s_notgene.count('A')>0 and s_notgene.count('C')>0 and s_notgene.count('G')>0 and s_notgene.count('T')>0:
gene_seqs.append(s_gene)
notgene_seqs.append(s_notgene)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
Next, we will concatenate the gene and intergenic sequences into a data set. Checking the length of this large list of sequences, we can see that we have a fair number of statistical observations to run Convolutional Neural Networks (CNNs).
sequences = gene_seqs + notgene_seqs
len(sequences)
Here we prepare a list of sequence labels to be used in the CNN. We denote gene sequences as 1 and intergenic regions as 0. The length of this list of labels is of course equal to the length of the list of sequences above.
import numpy as np
labels = list(np.ones(len(gene_seqs))) + list(np.zeros(len(notgene_seqs)))
len(labels)
Now we need to one-hot-encode both sequences and lables for a correct input to CNN. We will use scikitlearn classes LabelEncoder and OneHotEncoder. The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3. The OneHotEncoder converts an array of integers to a sparse matrix where each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Next is the standard step of splitting the data set into training and test sub-sets. The latter will be used for the final evaluation of the model. Please note that the one-hot-encoded data set of sequences is three dimensional array analagous to how images are encoded for image recognition problems. First number in the array is the amount of statistical observations (sequences), second is the dimensionality of the data (length of the sequences) equal to the cut variable, and the third number denotes the numberof channels i.e. nucleotides (4 possible nucleotides, i.e. A, C, G and T). Overall the data looks like a 1D image, a proper 2D image would be represented by a four-dimensional array, where second and third dimensions would correspond to the height and width (in pixels) of the 2D images, the fourth dimension would be the number of color channels, e.g. R, G and B.
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.2, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
Finally it is time to contract a simple shallow Convolutional Neural Network (CNN) and start training it. We construct quite a shallow one-block VGG-like CNN, i.e. two consequent 1D convolutional layers followed by a 1D max-pooling layer. Light dropout regularization is used to prevent overfitting.
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform',
input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(8, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.4))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
The training curves behave fairly normal, we reach quite a high accuracy of 85% of classification. Slight overfitting is present after approximately 20 epochs of training, however nothing very dangerous, the training and validation curves are still in a close proximity from each other.
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1),
np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
The confusion matrix and final evaluation on the test data set demonstrate a good accuracy of 85% of sequence classification. At some point it would be interesting to have a look at the missclassified sequneces in order to understand what is wrong with them an why the CNN faild to classify them correctly. For each sequence we can not construct so-called Saliency Map, i.e. importance of each nucleotide for sequence classification. In this way we can see whether the CNN learnt certain patterns to be most important for gene vs. intergenic sequence classification.
import keras.backend as K
def compute_salient_bases(model, x):
input_tensors = [model.input]
gradients = model.optimizer.get_gradients(model.output[0][1], model.input)
compute_gradients = K.function(inputs = input_tensors, outputs = gradients)
x_value = np.expand_dims(x, axis=0)
gradients = compute_gradients([x_value])[0][0]
sal = np.clip(np.sum(np.multiply(gradients,x), axis=1),a_min=0, a_max=None)
return sal
sequence_index = 12
K.set_learning_phase(1) #set learning phase
sal = compute_salient_bases(model, input_features[sequence_index])
plt.figure(figsize=[16,5])
zoom = len(sal)
barlist = plt.bar(np.arange(len(sal[0:zoom])), sal[0:zoom])
plt.xlabel('Bases')
plt.ylabel('Magnitude of saliency values')
plt.xticks(np.arange(len(sal[0:zoom])), list(sequences[sequence_index][0:zoom]), size = 6);
plt.title('Saliency map for bases in one of the sequences');
plt.show()
Now we are going to apply a similar methodology as the one used for gene vs. intergenic sequence classification, but now we will apply it to classify regions of Neanderthal introgression within modern human genome vs. regions of depleted Neanderthal ancestry. We are goin to train our CNN model on Neanderthal introgressed haplotypes identified using S*-statistic in the study of Vernot and Akey, Science 2016, on Europeans and Asians from the 1000 Genomes project from here https://drive.google.com/drive/folders/0B9Pc7_zItMCVWUp6bWtXc2xJVkk. We used the coordinates of introgressed haplotypes from the file introgressed_haplotypes/LL.callsetEUR.mr_0.99.neand_calls_by_hap.bed.merged.by_chr.bed and selected only unique coordinates, so we ended up with 83 601 regions of Neanderthal introgression in modern Europeans. Let us read these coordinates and have a look at the length distribution of the Neanderthal introgressed regions.
import os
import pandas as pd
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_coords = pd.read_csv('Akey_intr_coords.bed', header = None, sep = "\t")
intr_coords.head()
intr_coords.shape
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
sns.distplot(intr_lengths)
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
from scipy import stats
print(stats.describe(intr_lengths))
We can see that the regions of Neanderthal introgression are much longer compared to the lengths of genes. Here we have the sortest introgressed haplotype of length 10 kb with the median length around 100 kb. Let us now as previously use samtools for extracting the sequences of the Neanderthal introgressed regions from the hg19 build of human reference genome.
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
a = 0
with open('hg19_intr_regions.fa', 'a') as fp:
for i in range(intr_coords.shape[0]):
coord = str(str(intr_coords.iloc[i, 0]) + ':'
+ str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
Next we will need to build a fasta-file with sequences of the same lengths but located outside of the Neanderthal introgressed regions. For this purpose we need the lengths of the human chromosomes which we can easily extract from the fai-file generated when indexing the hg19 human reference genome.
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes = chr_sizes.drop([2, 3, 4], axis = 1)
chr_sizes.head()
Now for each gene in the intr_coords DataFrame, we are going to randomly draw a region of the same length as the gene on the same chromosome and check whether this region overlaps with any other gene (not only with this one) on the same chromosome. If it does not, we will add this region to the depl_coords DataFrame. If it does overlap, we repeat the random drawing of the region gain and again until we succeeed in selecting a truly intergenic region of the same length as the given gene.
import numpy as np
chr_list = []
start_list = []
end_list = []
intr_lengths = list(intr_coords.iloc[:, 2] - intr_coords.iloc[:, 1])
a = 0
for i in range(intr_coords.shape[0]):
chr_df = intr_coords[intr_coords[0].isin([intr_coords.iloc[i,0]])]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords.iloc[i,0]].iloc[:,1]))
reg_end = reg_start + intr_lengths[i]
for j in range(chr_df.shape[0]):
b1 = chr_df.iloc[j,1]
b2 = chr_df.iloc[j,2]
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2) or \
(b1 > reg_start and b1 < reg_end) or (b2 > reg_start and b2 < reg_end):
overlap = True
break
else:
overlap = False
chr_list.append(intr_coords.iloc[i,0])
start_list.append(reg_start)
end_list.append(reg_end)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.to_csv("Akey_depl_coords.bed", index = False, header = False, sep = "\t")
depl_coords.head()
Let us make sure that we indeed managed to contruct a set of regions of equal size as the Neanderthal introgressed regions but not overlapping with the later ones. For this purpose, we will use bedtools intersect and feed the introgression and depletion coordinates, the tool should not find any overlapping intervals.
!bedtools intersect -a Akey_intr_coords.bed -b Akey_depl_coords.bed | wc -l
Excellent, the two sets of coordinates do not overlap. We can continue with using the depletion coordinates and extract the actual sequences from the human reference genome hg19 using samtools.
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
a = 0
with open('hg19_depl_regions.fa', 'a') as fp:
for i in range(depl_coords.shape[0]):
coord = str(str(depl_coords.iloc[i, 0]) + ':'
+ str(depl_coords.iloc[i, 1]) + '-' + str(depl_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' Neanderthal ancestry depleted regions')
Everything looks good at first glance, howeever taking a closer look we can see that we have quite a few sequences with N-nulceotides. This is a kind of missing data since N implies nucleotides that could not be identified by the sequencer. A fast way to see the presense of N-nucleoties is to grep them.
!grep -c N hg19_intr_regions.fa
!grep -c N hg19_depl_regions.fa
However we know that samtools splits sequences into multiple lines, so grep does not actually detect the number of N-containing entries but the number of N-containing lines. In order to be more precise, we can quickly count using Bio Python the numbers of entries in each fasta-file that contain at least one N-nucleotide.
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_intr_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print('Introgressed regions file contains ' + str(i) + ' entries with at least one N-nucleotide')
j = 0
for record in SeqIO.parse('hg19_depl_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
j = j + 1
print('Depleted regions file contains ' + str(j) + ' entries with at least one N-nucleotide')
In case fractions of missing data are different between Neanderthal introgressed and depleted regions, this might be treated as a signal by Convolutional Neural Networks (CNNs), and this is not what we are interested in. Therefore for simplicity we will remove all entries containing N-nucleotides in either Neanderthal introgressed or depleted regions. For example, if a Neanderthal introgressed region contains at least one N-nucleotide, we will drop the region together with the corresponding Neadnerthal depleted region despite that one might not contain N-nucleotides. This is needed for a better correspondence between the introgressed and depleted fasta-files.
from Bio import SeqIO
intr_file = 'hg19_intr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0
i = 0
with open('hg19_intr_clean.fa', 'a') as intr_out, open('hg19_depl_clean.fa', 'a') as depl_out:
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
upper_intr = intr.seq.upper()
upper_depl = depl.seq.upper()
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
intr.seq = upper_intr
SeqIO.write(intr, intr_out, 'fasta')
depl.seq = upper_depl
SeqIO.write(depl, depl_out, 'fasta')
i = i + 1
else:
continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Thus we have removed approximately 10000 entries, but still have plenty of sequences left to run CNNs for Neanderthal introgressed vs. depleted region classification. Now let us quickly check using grep whether we indeed are free of N-containing sequences.
!grep -c N hg19_intr_clean.fa
!grep -c N hg19_depl_clean.fa
Excellent! Now it is time to start building the input matrix of sequences to be fed int othe CNN. Since I have limited memory on my laptop, I will not read all the sequences into memory but extract first cut = 1000 nucleotides from each sequence. The Neanderthal introgressed and depleted sequences of length less than cut = 1000 nucleotides will be ignored and not included into the CNN input matrix. We also have to make sure that all 4 nucleotides are present at each sequence, i.e. we will omit e.g. AT or CG repeat sequences.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cut = 1000
if len(str(intr.seq)) < cut or len(str(depl.seq)) < cut:
continue
s_intr = str(intr.seq)[0:cut]
s_depl = str(depl.seq)[0:cut]
if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
sequences = intr_seqs + depl_seqs
len(sequences)
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.2, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
from keras.models import Sequential
from keras.regularizers import l2, l1
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
from keras.layers import Conv1D, Conv2D, Dense, MaxPooling1D, MaxPooling2D, Flatten, Dropout, Activation
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same',
input_shape = (train_features.shape[1], train_features.shape[2])))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
#model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(1000, kernel_regularizer = l1(0.00001)))
model.add(Activation("sigmoid"))
#model.add(Dropout(0.1))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.001
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
#sgd = SGD(lr = lrate, momentum = 0.9, nesterov = False)
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_binary_accuracy', verbose = 1,
save_best_only = True, mode = 'max')
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
We can see that the model starts overfitting almost immediately and it is hard to regularize it, i.e. adding heavier Dropout prevents model from starting leraning. Hard to say why we observe this behavior. The number of examples for training is almost 100 000 which seems to be a lot, although it is still not enough if you look at how many parameters the CNN used. What we definitely observe is that increasing the length of the segments cut provides higher and higher accuracy of classification. Here I am restricted by the memory of my laptop and can not further increase the length of the segments. In addition, it would lead to even larger parameter space, i.e. even heavier overfitting. Perhaps more examples would help, for example chopping each Neanderthal introgression segment into multiple short ~100 bp segments would increase the training set without increasing the parameter space.
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1),
np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.0f%%" % (scores[1]*100))
We reach quite good accuracy of 71% on the test data set. This is not fantastic but perhaps can be improved by a more clever selection of the regions of Neanderthal introgression, i.e. applying RepeatMasker or something like this. Again, similar to the gene vs. intergenic regions calssification, for each sequence we can not construct so-called Saliency Map, i.e. importance of each nucleotide for sequence classification. In this way we can see whether the CNN learnt certain patterns to be most important for Neanderthal introgression vs. depletion sequence classification.
import keras.backend as K
def compute_salient_bases(model, x):
input_tensors = [model.input]
gradients = model.optimizer.get_gradients(model.output[0][1], model.input)
compute_gradients = K.function(inputs = input_tensors, outputs = gradients)
x_value = np.expand_dims(x, axis=0)
gradients = compute_gradients([x_value])[0][0]
sal = np.clip(np.sum(np.multiply(gradients,x), axis=1),a_min=0, a_max=None)
return sal
sequence_index = 5
K.set_learning_phase(1) #set learning phase
sal = compute_salient_bases(model, input_features[sequence_index])
plt.figure(figsize=[16,5])
zoom = len(sal)
barlist = plt.bar(np.arange(len(sal[0:zoom])), sal[0:zoom])
plt.xlabel('Bases')
plt.ylabel('Magnitude of saliency values')
plt.xticks(np.arange(len(sal[0:zoom])), list(sequences[sequence_index][0:zoom]), size = 6);
plt.title('Saliency map for bases in one of the sequences');
plt.show()
Here we will try to convert our sequences into 2D images and run a 2D Convolutional Neural Network (CNN) as if we were to solve a pattern recognition task for image classification. The problem is that the final numpy array will be 4-dimensional: first dimension is the number of sequences / statistical observations / examples for training, second and third dimensions will be the 2D image size (aka in pixels, where one pixel means one nucleotide), fourth dimension will be the probability to observe one of the 4 nucleotides (A, C, T or G) or this can be seen as a dimension corresponding to 4 channels in a CMYK color coding. Because of the 4D structure of the input data set, I will need more memory than I have on my laptop. Currently I can input 1000bp stretches of DNA (with the following one-hot-encoding) when implementing 1D CNN, or alternatively I can input 32bp stretches of DNA working with 2D CNN since $32x32 \approx 1000$.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cut = 32
if len(str(intr.seq)) < cut or len(str(depl.seq)) < cut:
continue
s_intr = str(intr.seq)[0:cut]
s_depl = str(depl.seq)[0:cut]
if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
sequences = intr_seqs + depl_seqs
len(sequences)
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
input_features.shape
B = np.array(np.repeat(input_features[:, :, np.newaxis, :], repeats = input_features.shape[1], axis = 2))
B.shape
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
B, input_labels, test_size = 0.2, random_state = 42)
from keras.models import Sequential
from keras.regularizers import l2, l1
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
from keras.layers import Conv1D, Conv2D, Dense, MaxPooling1D, MaxPooling2D, Flatten, Dropout, Activation
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv2D(filters = 32, kernel_size = (5, 5), padding = 'same',
input_shape = (train_features.shape[1], train_features.shape[2], train_features.shape[3])))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2, 2)))
#model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(10)) # kernel_regularizer = l1(0.00001)
model.add(Activation("sigmoid"))
#model.add(Dropout(0.1))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.001
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
#sgd = SGD(lr = lrate, momentum = 0.9, nesterov = False)
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_binary_accuracy', verbose = 1,
save_best_only = True, mode = 'max')
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
Here we are going to ask a simple question: do human genes from hg19 refernce genome predominantly overlap with Neanderthal introgressed or depleted regions? To answer this question we are going to do heavy randomizations. First, we will use bedtools to calculate the number of intersects between Neanderthal introgressed regions and genes. Second, we are going to randomly draw Neanderthal depleted regions of the same length as the introgressed regions a number of times and calculate intersects with the genes for each draw. In this way we will plot the distribution of the intersects between the depleted and gene regions and show whether the genes predominantly overlap with the Neanderthal introgressed or depleted regions. Let us start with reading the coordinates of the introgressed regions:
import os
import pandas as pd
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_coords = pd.read_csv('Akey_intr_coords.bed', header = None, sep = "\t")
intr_coords.head()
Now let us check the number of intersects between the introgressed regions and the genes from hg19 human reference genome:
%%bash
Path=/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes
bedtools intersect -a Akey_intr_coords.bed -b $Path/gene_coords.txt | wc -l
And for comparision let us check the number of intersects between the depleted regions and the genes from hg19 human reference genome:
%%bash
Path=/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes
bedtools intersect -a Akey_depl_coords.bed -b $Path/gene_coords.txt | wc -l
We can see that genes intersect with depleted regions more often than with the introgressed ones. However what if we just managed by chance select "good" depleted regions that happen to overlap more often with the genes. To make a robust statement about overlapping of introgressed vs. depleted regions with genes we need to randomly draw depleted regions at least a few more times. To do that we again need to know the lengths of the chromosomes from hg19.
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes = chr_sizes.drop([2, 3, 4], axis = 1)
chr_sizes.head()
Now we repeat drawing depleted regions a few times. This procedure takes quite a lot of time due to my non-optimal code implementation. So I will repeat this ~10-20 times, which should be enough to demonstrate that the number of intersects between depeleted regions and genes is always larger than between the introgressed regions and genes.
import os
import subprocess
import numpy as np
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
perm_n = []
for k in range(5):
chr_list = []
start_list = []
end_list = []
intr_lengths = list(intr_coords.iloc[:, 2] - intr_coords.iloc[:, 1])
a = 0
for i in range(intr_coords.shape[0]):
chr_df = intr_coords[intr_coords[0].isin([intr_coords.iloc[i,0]])]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0]==intr_coords.iloc[i,0]].iloc[:,1]))
reg_end = reg_start + intr_lengths[i]
for j in range(chr_df.shape[0]):
b1 = chr_df.iloc[j,1]
b2 = chr_df.iloc[j,2]
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2) or \
(b1 > reg_start and b1 < reg_end) or (b2 > reg_start and b2 < reg_end):
overlap = True
break
else:
overlap = False
chr_list.append(intr_coords.iloc[i,0])
start_list.append(reg_start)
end_list.append(reg_end)
a = a + 1
if a%20000 == 0:
print('Finished ' + str(a) + ' Neanderthal haplotypes')
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.to_csv("temp.txt", index = False, header = False, sep = "\t")
genes_path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/'
with open('n.txt', 'w') as fp:
subprocess.run(['bedtools', 'intersect', '-a', 'temp.txt', '-b',
genes_path + 'gene_coords.txt'], stdout = fp)
akey_n = pd.read_csv('n.txt', header = None, sep = "\t")
print(k, akey_n.shape[0])
print('**********************************************************')
perm_n.append(akey_n.shape[0])
perm_n = [150438, 150340, 149772, 149798, 149664, 150748, 151154,
150818, 150498, 151426, 151244, 151267, 150060, 151156, 149792, 150381, 150075]
perm_n
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,12))
plt.axvline(x = 140821, linewidth = 4, color = 'r')
sns.distplot(perm_n)
plt.title("Distribution of Gene-Depletion Intersects: Vernot and Akey, Science 2016", fontsize = 30)
plt.xlabel("Number of Intersects Between Gene and Neanderthal Depleted Regions", fontsize = 30)
plt.ylabel("Frequency", fontsize = 30)
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.axvline(x = 140821, linewidth = 4, color = 'r')
sns.distplot(perm_n)
plt.title("Distribution of Gene-Depletion Intersects: Vernot and Akey, Science 2016", fontsize = 20)
plt.xlabel("Number of Intersects Between Gene and Neanderthal Depleted Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
Here it is remarkable that none of the 17 drawings of depleted regions gives the number of intersects with genes which is below the number of intersects between the introgressed regions and the genes. If we were to calculate a p-value of the enrichment, this would be essentially but a more correct way to express it would pvalue < 1 / 17 = 0.06. This proves that genes predominantly overlap with the regions of depleted Neanderthal ancestry.
Now let us perform a simple exercise which will be very useful for us in the future. Here we are going to calculate GC-content of each gene from their sequences. We do have a fasta-file with sequences for each gene extracted from the hg19 build of human reference genome. Now it will be relatively easy to go throug those sequences one-by-one and count the percentages of C ad G nucleotides. The goal of this exercise is to figure out which genes in human genome are most GC-rich and most AT-rich, again, this we will need in the future.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
ids = []
seqs = []
GC_content = []
with open('hg19_gene_regions.fa','r') as fin:
for line in fin:
line = line.rstrip()
if line[0] == '>' and len(seqs) == 0:
ids.append(line)
elif line[0] == '>' and len(seqs) > 0:
my_seq = ''.join(seqs).upper()
GC = round((my_seq.count('C') + my_seq.count('G')) / (len(my_seq) - my_seq.count('N')), 4) * 100
GC_content.append(GC)
seqs = []
ids.append(line)
else:
seqs.append(line)
my_seq = ''.join(seqs).upper()
GC = round((my_seq.count('C') + my_seq.count('G')) / (len(my_seq) - my_seq.count('N')), 4) * 100
GC_content.append(GC)
GC_df = pd.DataFrame({'Coord': ids, 'GCcontent': GC_content})
GC_df = GC_df.sort_values(by = ['GCcontent'], ascending = False)
print('Most GC-rich genes:')
print(GC_df.head())
print('\n')
print('Most AT-rich genes:')
print(GC_df.tail())
GC_df.to_csv('genes_GC_content.txt', index = False, header = True, sep = '\t')
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/
samtools faidx hg19.fa.gz chr4:126428414-126428462 | sed '1d' | tr -d '\n'
my_str = 'CTGTAATATAAATTTAATTTATTCTCTATCATTAAAAAATGTATTACAG'
(my_str.count('C') + my_str.count('G')) / len(my_str)
Ok, it would be nice to see the gene symbols. It is trange to see genes with 100% GC-content, those must be some non-coding RNAs.
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt', header=None, sep="\t")
gene_coords['Coord'] = '>' + gene_coords[0].astype(str) + ':' \
+ gene_coords[1].astype(str) + '-' + gene_coords[2].astype(str)
gene_coords.head()
GC_df_gene = pd.merge(GC_df, gene_coords, on = 'Coord')
GC_df_gene = GC_df_gene[[3, 0, 1, 2, 'GCcontent']]
GC_df_gene.columns = ['Gene', 'Chr', 'Start', 'End', 'GCcontent']
GC_df_gene.head()
GC_df_gene.tail()
Yes, indeed, the micro-RNAs (the MIR-genes) seem to be extreme in the GC-content. Let us filter them out and display most GC-rich and and most AT-rich genes again without the MIR-genes, LINC-genes or SNOR-genes which are the small non-coding RNAs.
GC_df_gene_noMIR = GC_df_gene[~GC_df_gene['Gene'].str.contains('MIR')]
GC_df_gene_noMIR = GC_df_gene_noMIR[~GC_df_gene_noMIR['Gene'].str.contains('LINC')]
GC_df_gene_noMIR = GC_df_gene_noMIR[~GC_df_gene_noMIR['Gene'].str.contains('SNOR')]
GC_df_gene_noMIR.head(10)
GC_df_gene_noMIR.to_csv('genes_GC_content.txt', index = False, header = True, sep = '\t')
GC_df_gene_noMIR.tail(10)
It is very interesting that there are human genes (that are not small non-coding RNAs) with GC-content as large as ~80% and there are genes with as little as ~30% of GC-content. Let us display the distribution of the GC-content across genes.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
sns.distplot(GC_df_gene_noMIR['GCcontent'])
plt.title("Distribution of GC-content Across Genes from HG19 Human Reference Genome", fontsize = 20)
plt.xlabel("GC-content of Genes", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
We can see that the mode is indeed at aroud 41% as it is expected to be, however the distribution does not look very symmetric and seems to have much more genes with GC-content larger than 41% than those with GC-content below 41%. This probably reflects the known statement that genes are ingeneral GC-rich regions. If we calculate the mean GC-content of the gene regions, it will be arounf 47% which is much higher than 41% across the whole human reference genome.
np.mean(GC_df_gene_noMIR['GCcontent'])
Another very interesting observation is that there is ANGPTL3 gene which is very AT-rich with GC-content only ~30%. This gene has a very strong link to HDL cholesterol levels in human blood which is one of determnant factors for genetics of Type 2 Diabetes (T2D) Mellitus. Indeed, if we display a piece of the ANGPTL3 gene we observe that it is an extremely GC-poor gene.
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/
samtools faidx hg19.fa.gz chr1:63063158-63071976 | sed '1d' | tr -d '\n' | cut -c1-500
Below we will show that AT-rich regions predominantly came to the human genome from Neanderthals. Particularly, the ANGPTL3 gene will be predicted to be a Neanderthal gene that was introgressed into the modern human genome ~2000 generations ago when humans migrated out of Africa and bred with Neanderthals in Europe and Asia.
Here we are going to use elements of the Natural Language Processing (NLP) and K-mer analysis in order to find DNA motifs that differ between Neanderthal introgressed and depleted regions of human genome. DNA sequence is essentially a text so we can apply the whole power of NLP apparatus to DNA analysis. However, where are sentences and words in one large DNA sequence? It turns out that a sequence can be split into K-mers which represent words / tokens and those K-mers can be concatenated in a space-dilimited manner so that we get a sentence at the end. Now we are done, we cen use NLP!
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'Kmers.png', width=2000)
Here we will start with the simplest Bag of Words model that simply counts frequencies of words / K-mers between two different classes of texts / sequences. In our case we are asking: do Neanderthal introgressed regions contain more frequent K-mers compared to the regions of depleted Neanderthal ancestry? To answer this question we are going to read the introgressed and depleted sequences and split them into K-mers:
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 500
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
#my_intr_reverse_compliment_seq = str(Seq(my_intr_seq).reverse_complement())
#intr_seqs.append(my_intr_reverse_compliment_seq)
depl_seqs.append(my_depl_seq)
#my_depl_reverse_compliment_seq = str(Seq(my_depl_seq).reverse_complement())
#depl_seqs.append(my_depl_reverse_compliment_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
Now, after we have read the introgressed and depleted sequences into the memory, we will need to define a function that splits each given sequence into K-mers, and use it for preprocessing the sequences and converting them into sentences via K-mers construction. Suppose the sequences in the list above are different texts. It is natural to consider k-mers as words of those texts. Different sequences can share the same k-mers indicating that the same words can be used in different texts. However, there are words / k-mers that are specific for certain texts / sequences, or their number can say something about the topic of the text / biology of the sequence. Here we are going to split each sequence into k-mers and construct sentences which represent lists of words / k-mers.
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
print('Building Neanderthal introgressed sequences')
intr_sentences = []
for i in range(len(intr_seqs)):
intr_sentences.append(getKmers(intr_seqs[i], kmer))
print('Building Neanderthal depleted sequences')
depl_sentences = []
for i in range(len(depl_seqs)):
depl_sentences.append(getKmers(depl_seqs[i], kmer))
The words / k-mers provide a vocabulary, we will determine its size later. We can also use the Counter class for an efficient counting of the words as well as displaying and making barplots of the most common words for Neanderthal introgressed and depleted regions.
from collections import Counter
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(20,18))
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
plt.subplot(2, 1, 1)
D = dict(Counter([item for sublist in intr_sentences for item in sublist]).most_common(20))
plt.bar(range(len(D)), list(D.values()), align='center')
plt.title('Most Common K-mers for Neanderthal Introgressed Regions', fontsize = 20)
plt.ylabel("Counts", fontsize = 20)
plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)
plt.subplot(2, 1, 2)
D = dict(Counter([item for sublist in depl_sentences for item in sublist]).most_common(20))
plt.bar(range(len(D)), list(D.values()), align='center')
plt.title('Most Common K-mers for Neanderthal Depleted Regions', fontsize = 20)
plt.ylabel("Counts", fontsize = 20)
plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)
plt.show()
We can see that A- and T-rich K-mers seem to be most common for both Neanderthal introgressed and depleted regions. However, there are small differences in their counts that might be indicative for different composition between them. We can also build the data frames of K-mer counts for both Neanderthal introgressed and depleted regions and display them:
import pandas as pd
intr_counts = dict(Counter([item for sublist in intr_sentences for item in sublist]))
kmers = list(intr_counts.keys())
counts = list(intr_counts.values())
intr_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
intr_df = intr_df.sort_values(['Count'], ascending = False)
intr_df.head(10)
import pandas as pd
depl_counts = dict(Counter([item for sublist in depl_sentences for item in sublist]))
kmers = list(depl_counts.keys())
counts = list(depl_counts.values())
depl_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
depl_df = depl_df.sort_values(['Count'], ascending = False)
depl_df.head(10)
Both data frames contain the same K-mer vacabulary in the Kmer column but in different order due to different counts of the K-mers. Now we are going to merge the two data frames and calculate odds ratios between the counts for each K-mer. Then we will order the K-mers by the highest Odds ratio between depleted and introgressed regions. The highest Odds ratio shows which K-mers are most discriminative between the two classes of genomic regions.
merge_df = pd.merge(intr_df, depl_df, on = 'Kmer')
merge_df.columns = ['Kmer','Count_Intr','Count_Depl']
merge_df['Odds_Depl2Intr'] = merge_df['Count_Depl'] / merge_df['Count_Intr']
sorted_merge_df = merge_df.sort_values(['Odds_Depl2Intr'], ascending = False)
sorted_merge_df.head()
sorted_merge_df.tail()
We can easily see that the most discriminative K-mers between Neanderthal depleted and introgressed regions that are overrepresented in depleted regions are very GC-rich. This gives us a hint that ragions of depleted Neanderthal ancestry have something to do with the high GC-content. We remember from the previous section that Human Genes are typically GC-rich (47% GC-content against 41% GC-content genome-wide). Here we conclude that regions of depleted Neanderthal ancestry fall predominantly within human genes. This confirms the result of the section about intersects between gene regions and Neanderthal introgressed / depleted regions.
We can rank all the K-mers by the deviation of their Odds ratios from 1. This will demostrate feature importances or predictive power of each K-mer:
sorted_merge_df['PredictPower'] = abs(sorted_merge_df['Odds_Depl2Intr'] - 1)
sorted_merge_df.head()
sorted_merge_df.tail()
import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
sorted_merge_df.to_csv('BagOfWords_Neand_Intr_vs_Depleted.txt', header = True, index = False, sep = '\t')
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.bar(range(len(list(sorted_merge_df['Kmer']))),
list(sorted_merge_df['PredictPower']), align = 'center', width = 1)
plt.title('Predictive Power of K-mers', fontsize = 20)
plt.ylabel("Magnitude of Prediction", fontsize = 20)
plt.xticks(rotation = 90)
plt.xticks(list(range(len(list(sorted_merge_df['Kmer']))))[0::8],
list(sorted_merge_df['Kmer'])[0::8], fontsize = 8)
plt.show()
The y-axis of the barplot shows the amplitude of the deviation from 1 of the Odds ratios for each K-mer. The large is the deviation of the Odds ratio from 1, the more frequent is the K-mer in either depleted Neanderthal ancestry regions (left tail of the barplot) or introgressed Neanderthal regions (right tail of the barplot). Here I display every 8-th bar label for better visibility. Nevertheless, one can notice that the left tail of the barplot (K-mers frequent in the regions of depleted Neanderthal ancestry) contains mostly GC-rich K-mers. This implies that the regions of depleted Neanderthal ancestry are more likely to be the Human Gene regions as they are known to have higher GC-content compared to the average genome-wide level.
Out of curiosity, let us check what K-mers are most frequent in the ANGPTL3 gene which had a very low GC-content, i.e. was extremely AT-rich. Fist we read it sequence and split it inro K-mers, then we use the class Counter to count K-mers in the sequence which was previously converted to a list of K-mers.
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/
samtools faidx hg19.fa.gz chr1:63063158-63071976 | sed '1d' | tr -d '\n' > ANGPTL3_sequence.txt
import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
with open('ANGPTL3_sequence.txt','r') as fin:
ANGPTL3_seq = fin.read().upper()
import pandas as pd
ANGPTL3_counts = dict(Counter(getKmers(ANGPTL3_seq, 5)))
kmers = list(ANGPTL3_counts.keys())
counts = list(ANGPTL3_counts.values())
ANGPTL3_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
ANGPTL3_df = ANGPTL3_df.sort_values(['Count'], ascending = False)
ANGPTL3_df.head(10)
ANGPTL3_df.tail(10)
As we can see, the ANGPTL3 gene is indeed very AT-rich. The most frequent K-mers seems to contain only A and T nucleotides. In contrast, it seems that GC-rich K-mers are not very common in the ANGPTL3 gene.
Now we are going to use the Bag Of Words model, i.e. K-mer counting, for training a simple Random Forest classifier that should learn to classify sequences to belong to Neanderthal introgressed or depleted regions. As previously, we start with reading the two fasta-files with the introgressed and depleted sequences into the memory.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 8800
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
#my_intr_reverse_compliment_seq = str(Seq(my_intr_seq).reverse_complement())
#intr_seqs.append(my_intr_reverse_compliment_seq)
depl_seqs.append(my_depl_seq)
#my_depl_reverse_compliment_seq = str(Seq(my_depl_seq).reverse_complement())
#depl_seqs.append(my_depl_reverse_compliment_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
Now we will use the getKmers function that splits the sequences into K-mers. We apply this function to each sequence and concatenate the K-mers on the fly in space-delimited manner in order to build sentences out of sequences. K-mers are the words in those sentences.
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
intr_texts[0][0:155]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
depl_texts[0][0:155]
Thus we have built two lists (intr_texts and depl_texts) of sentences, one for introgressed sequences, the other one is for depleted sequences. One can say that e.g. intr_texts is a text with many sentences, those sentences consist of K-mers as words or tokens. The words are all of the same length and are space-delimited in each sentence. We are goint to merge the two texts into a single big list / text of sentences, and create a list of labels for each sentence where 1 corresponds to sentences coming from Neanderthal introgressed texts and 0 corresponds to sequences coming from Neanderthal depleted texts.
merge_texts = intr_texts + depl_texts
len(merge_texts)
import numpy as np
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
print(len(labels))
Now we are going to apply a simple aka one-hot-encoding transfrom to each word in the vocabulary of size 4^kmer in order to convert the words / K-mers in each sentence into a numeric / computer-friendly form. Essentially we are building a matrix of counts for each word / K-mer where columns are the K-mers and rows are the sentences / sequences split into K-mers. The elements of the matrix are how many times each word / K-ner is found at each sentence / sequence. This can be done via scikitlearn class CountVectorizer. We will now have a look at how the transformed data looks like as well the dimensions of the transformed data matrix:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
cv = CountVectorizer()
X = cv.fit_transform(merge_texts)
#tfidf_transformer = TfidfTransformer()
#X = tfidf_transformer.fit_transform(X)
#tokenizer = Tokenizer()
#tokenizer.fit_on_texts(merge_texts)
#encoded_docs = tokenizer.texts_to_sequences(merge_texts)
#max_length = max([len(s.split()) for s in merge_texts])
#X = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(X.toarray())
print('\n')
print(X.shape)
Now we have built the input X matrix and the list of labels to be fed into the classifier. Next, per standard, we are going to randomly split the data set X into training, X_train, and testing, X_test, sub-sets. The same random split will be simultaneously done for the labels.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.toarray(), labels, test_size=0.20, random_state=42)
print(X_train.shape)
print(X_test.shape)
Now we will train the Random Foresr classifier and evaluate it on the test data set.
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#classifier = GaussianNB()
#classifier = LinearSVC()
#classifier = LogisticRegression()
classifier = RandomForestClassifier(n_estimators = 500)
#classifier = XGBClassifier(n_estimators = 100)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
print("Confusion matrix\n")
print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))
def get_metrics(y_test, y_predicted):
accuracy = accuracy_score(y_test, y_predicted)
precision = precision_score(y_test, y_predicted, average='weighted')
recall = recall_score(y_test, y_predicted, average='weighted')
f1 = f1_score(y_test, y_predicted, average='weighted')
return accuracy, precision, recall, f1
accuracy, precision, recall, f1 = get_metrics(y_test, y_pred)
print('\n')
print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))
import os
import pickle
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
pickle.dump(classifier, open('RF_model_Neand_Intr_vs_Depl.sav', 'wb'))
Let us now read the weights of the Random Forest classifier and display the Confusion Matrix and a final evaluation accuracy calculated on the test data set.
import os
import pickle
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
classifier = pickle.load(open('RF_model_Neand_Intr_vs_Depl.sav', 'rb'))
predicted_labels = classifier.predict(X_test)
predicted_labels
import itertools
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
plt.figure(figsize=(15,10))
cm = confusion_matrix(y_test, [np.round(i) for i in predicted_labels])
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
From the Confusion Matrix for Random Forest Classifier (above) it becomes clear that the model had a very high accuracy when predicting segments of depleted Neanderthal ancestry, however performed very poorly (worse than the neural network) on classifying introgressed regions.
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, predicted_labels)
print("Accuracy: %.2f%%" % (score*100))
classifier.feature_importances_
Surprisingly, we achieve very high accuracy with the simple Random Forest classifier using the Bag Of Words (basically K-mer frequencies) NLP model. We can display feature importances which any Machine Learning algorithm delivers, including the Random Forest:
names = cv.get_feature_names()
names = [i.upper() for i in names]
feature_import = sorted(zip(map(lambda x: str(round(x, 5)), classifier.feature_importances_), names),
reverse = True)
feature_import[0:20]
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
importances = classifier.feature_importances_
std = np.std([tree.feature_importances_ for tree in classifier.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
plt.title("Feature importances", fontsize = 20)
plt.bar(range(X_train.shape[1])[0:50], importances[indices][0:50],
yerr = std[indices][0:50], align = "center")
plt.xticks(rotation = 90)
plt.ylabel('Score', fontsize = 20)
plt.xticks(range(X_train.shape[1])[0:50], np.array(names)[indices][0:50])
plt.show()
We can see that the most common AT-rich K-mers turn out to be most informative for the Random Forest classifier within the Bag Of Words NLP model. Therefore, despite those AT-rich K-mers are most frequent in both Neanderthal introgressed and depleted sequences, small differences between the two types of sequences make the Random Forest accurately classify each given sequence. Let us take the ANGPTL3 sequence and make a prediction for that sequence:
import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
with open('ANGPTL3_sequence.txt','r') as fin:
ANGPTL3_seq = fin.read().upper()
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 8800
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
#my_intr_reverse_compliment_seq = str(Seq(my_intr_seq).reverse_complement())
#intr_seqs.append(my_intr_reverse_compliment_seq)
depl_seqs.append(my_depl_seq)
#my_depl_reverse_compliment_seq = str(Seq(my_depl_seq).reverse_complement())
#depl_seqs.append(my_depl_reverse_compliment_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
import os
import pickle
from sklearn.feature_extraction.text import CountVectorizer
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
classifier = pickle.load(open('RF_model_Neand_Intr_vs_Depl.sav', 'rb'))
cv = CountVectorizer()
merge_texts = intr_texts + depl_texts
merge_texts.append(' '.join(getKmers(ANGPTL3_seq[0:cutoff], kmer)))
X_ANGPTL3 = cv.fit_transform(merge_texts)
int(classifier.predict(X_ANGPTL3[-1,:].toarray()))
classifier.predict_proba(X_ANGPTL3[-1,:].toarray())
We can see that the Random Forest classifier gives 66% probability that this gene was not inherited from Neanderthals, 0 predicted label means the ANGPTL3 sequence most likely falls within the regions of depleted Neanderthal ancestry. This is not a high probability, and we know from the confusion matrix that when we predict 0, it can be actually 1 in almost 14% of cases, so the prediction for ANGPTL3 might not be accurate enough. So we can not make a confident prediction for ANGPTL3 but perhaps there will be other genes that can be predicted to be inherited from Neanderthal. We can actually make a loop and use the trained Random Forest classifier model for making prediction for each gene.
Now we will use the trained Random Forest classifier in order to make predictions for all genes in the hg19 human reference genome. For this purpose we will have to 1) select quite long genes with length > cut, 2) merge the gene sentences together with the introgressed and depleted sentences in order to process them all together through the CountVectorizer so that the sentences for prediction are prepared in identical way as the data used for training the Random Forest classifier. As usually, we will start with reading the introgressed and depleted sequences, split them into K-mers and building sentences / texts.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 8800
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
depl_seqs.append(my_depl_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
print(len(intr_texts))
print(len(depl_texts))
Now we will do the same for the gene sequences, i.e. read them, split into K-mers and build sentences / texts:
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/Genes/')
gene_file = 'hg19_gene_clean.fa'
a = 0
gene_seqs = []
gene_ids = []
for gene in SeqIO.parse(gene_file, 'fasta'):
cut = 8800
if len(str(gene.seq)) < cut:
continue
s_gene = str(gene.seq)[0:cut]
if s_gene.count('A')>0 and s_gene.count('C')>0 and s_gene.count('G')>0 and s_gene.count('T')>0:
gene_seqs.append(s_gene)
gene_ids.append(str(gene.id))
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' genes')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
gene_texts = [' '.join(getKmers(i, kmer)) for i in gene_seqs]
print(len(gene_texts))
print(len(gene_ids))
gene_ids[0:10]
import os
import pickle
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
with open('gene_ids.txt', 'w') as f:
for item in gene_ids:
f.write("%s\n" % item)
merge_texts = intr_texts + gene_texts
import os
import pickle
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
with open('merge_texts.txt', 'wb') as fp:
pickle.dump(merge_texts, fp)
Here we need to restsrt the Kernel in order to release memory, and then read the merge_texts list into the memory again:
import os
import pickle
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
with open ('merge_texts.txt', 'rb') as fp:
merge_texts = pickle.load(fp)
import os
import pickle
from sklearn.feature_extraction.text import CountVectorizer
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
classifier = pickle.load(open('RF_model_Neand_Intr_vs_Depl.sav', 'rb'))
cv = CountVectorizer()
X_gene = cv.fit_transform(gene_texts)
gene_predictions = classifier.predict(X_gene.toarray())
gene_predictions_prob = classifier.predict_proba(X_gene.toarray())
#gene_predictions = classifier.predict(X_gene[-X_gene.shape[0]:21503,:].toarray())
#gene_predictions_prob = classifier.predict_proba(X_gene[-X_gene.shape[0]:21503,:].toarray())
len(gene_predictions)
X_gene.shape
gene_predictions
import numpy as np
print(np.sum(gene_predictions==0))
print(np.sum(gene_predictions==1))
gene_predictions_prob
gene_predictions_prob_0 = [i[0] for i in gene_predictions_prob]
gene_predictions_prob_1 = [i[1] for i in gene_predictions_prob]
import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
gene_ids = []
gene_symbol = []
with open('gene_ids.txt','r') as fin:
for line in fin:
line = line.split('\t')
gene_ids.append(line[0])
gene_symbol.append(line[1].rstrip())
import pandas as pd
gene_pred_df = pd.DataFrame({'Gene': gene_ids, 'Gene_Symbol': gene_symbol, 'Predict': gene_predictions,
'Prob_0': gene_predictions_prob_0, 'Prob_1': gene_predictions_prob_1})
gene_pred_df = gene_pred_df.sort_values(['Prob_1'], ascending = False)
gene_pred_df.head(30)
gene_pred_df[gene_pred_df['Prob_1']>0.8].shape
gene_pred_df[gene_pred_df['Gene_Symbol']=="ANGPTL3"]
gene_pred_df.tail(30)
gene_pred_df.to_csv('Neanderthal_Genes.txt', header=True, index = False, sep = "\t")
Here we will still use the Bag of Words model but implement a Multilayer Perceptron model for classification of sequences coming from Neanderthal introgressed vs. depleted Neanderthal ancestry regions. We again start with reading the sequences from the two fasta-files (introgressed and depleted regions), split them into words / K-mers and build sentences out of them.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 10000
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
#my_intr_reverse_compliment_seq = str(Seq(my_intr_seq).reverse_complement())
#intr_seqs.append(my_intr_reverse_compliment_seq)
depl_seqs.append(my_depl_seq)
#my_depl_reverse_compliment_seq = str(Seq(my_depl_seq).reverse_complement())
#depl_seqs.append(my_depl_reverse_compliment_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
As we did previously, we will use CountVectorizer in order to convert words / K-mers into numeric representation, the elements of the obtained matrix X show how often each word / K-mer occurs in each sentence / sequence.
merge_texts = intr_texts + depl_texts
len(merge_texts)
import numpy as np
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
print(len(labels))
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
cv = CountVectorizer()
X = cv.fit_transform(merge_texts)
#tfidf_transformer = TfidfTransformer()
#X = tfidf_transformer.fit_transform(X)
#tokenizer = Tokenizer()
#tokenizer.fit_on_texts(merge_texts)
#X = tokenizer.texts_to_matrix(merge_texts, mode = 'freq')
#encoded_docs = tokenizer.texts_to_sequences(merge_texts)
#max_length = max([len(s.split()) for s in merge_texts])
#X = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post')
X = np.int32(X.toarray())
print(X)
print('\n')
print(X.shape)
Again, we split the data set into training, X_train, and testing, X_test, subsets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size = 0.20, random_state = 42)
print(X_train.shape)
print(X_test.shape)
Now the data is prepared to be fed into the Multilayer Perceptron model for classification of sequences coming from Neandertal introgressed vs. depleted Neanderthal ancestry regions.
from keras.models import Sequential
from keras.regularizers import l2, l1
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
model = Sequential()
model.add(Dense(3000, input_shape=(X.shape[1],), activation = 'sigmoid',
kernel_regularizer = l1(0.00001)))
#model.add(Dense(1000, activation = 'sigmoid'))
#model.add(Dense(100, activation = 'sigmoid'))
#model.add(Dense(10, activation = 'sigmoid'))
#model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
epochs = 200
lrate = 0.0001
#decay = lrate / epochs
#sgd = SGD(lr = lrate, momentum = 0.99, nesterov = True)
sgd = SGD(lr = lrate, momentum = 0.9, nesterov = False)
#sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
#model.compile(loss = 'binary_crossentropy', optimizer = Adam(lr = lrate), metrics = ['binary_accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['binary_accuracy'])
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_binary_accuracy', verbose = 1,
save_best_only = True, mode = 'max')
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(X_train, y_train,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
model.load_weights("weights.best.hdf5")
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(X_test)
cm = confusion_matrix(y_test, [np.round(i[0]) for i in predicted_labels])
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(X_test, y_test, verbose = 0)
print("Accuracy: %.2f%%" % (scores[1]*100))
We reached quite a good accuracy of classification, however still worse than the Random Forest model on the same Bag Of Words (frequencies of K-mers) data set. Let us see whether more advanced Word Embeddings models can improve the accuracy of Neanderthal introgressed vs. depleted sequence / sentence / text classification.
Here instead of the Bag of Words model we will implement a Multilayer Perceptron model with an Embedding Layer which is supposed to learn the Word Embeddings, i.e. finding correspondence between words with similar meaning. This is a model where the order of the words will be taken into account for classification of sequences coming from Neanderthal introgressed vs. depleted Neanderthal ancestry regions. We again start with reading the sequences from the two fasta-files (introgressed and depleted regions), split them into words / K-mers and build sentences out of them.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 4000
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
depl_seqs.append(my_depl_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
Hhere we will use the Tokenizer class from Kears in order to convert words / K-mers into integers.
merge_texts = intr_texts + depl_texts
len(merge_texts)
import numpy as np
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
print(len(labels))
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
#cv = CountVectorizer()
#X = cv.fit_transform(merge_texts)
#tfidf_transformer = TfidfTransformer()
#X = tfidf_transformer.fit_transform(X)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(merge_texts)
#X = tokenizer.texts_to_matrix(merge_texts, mode = 'freq')
encoded_docs = tokenizer.texts_to_sequences(merge_texts)
max_length = max([len(s.split()) for s in merge_texts])
X = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post')
#X = X.toarray()
print(X)
print('\n')
print(X.shape)
Again, we split the data set into training, X_train, and testing, X_test, subsets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size = 0.20, random_state = 42)
print(X_train.shape)
print(y_train[0:10])
print(X_test.shape)
print(y_test[0:10])
print(max_length)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
Now the data is prepared to be fed into the Multilayer Perceptron model for classification of sequences coming from Neandertal introgressed vs. depleted Neanderthal ancestry regions.
from keras.models import Sequential
from keras.regularizers import l2, l1
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, GlobalAveragePooling1D
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length = max_length)) #W_regularizer = l1(0.000001)
#model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', activation = 'relu'))
#model.add(MaxPooling1D(pool_size = 2))
#model.add(GlobalAveragePooling1D())
model.add(Flatten())
model.add(Dense(10, activation = 'sigmoid')) #kernel_regularizer = l1(0.00004)
#model.add(Dropout(0.5))
#model.add(Dense(10, activation = 'sigmoid'))
#model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
epochs = 20
#lrate = 0.0001
#decay = lrate / epochs
#sgd = SGD(lr = lrate, momentum = 0.9, nesterov = False)
#sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
#model.compile(loss = 'binary_crossentropy', optimizer = Adam(lr = lrate), metrics = ['accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
model.compile(loss = 'binary_crossentropy', optimizer = 'SGD', metrics = ['accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_acc', verbose = 1,
save_best_only = True, mode = 'max')
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(X_train, y_train,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
model.load_weights("weights.best.hdf5")
model.compile(loss = 'binary_crossentropy', optimizer = 'SGD', metrics = ['accuracy'])
import matplotlib.pyplot as plt
plt.figure(figsize = (20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
We can see that the model clearly overfits but still reaches quite high accuracies on the validation data set. Some regularization with Dropout or L1 / L2 would be good to test here, there is a potential for improvements.
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(X_test)
cm = confusion_matrix(y_test, [np.round(i[0]) for i in predicted_labels])
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
We conclude again that the Embedding Layer did not improve the classification accuracy compared to the Bag of Words model. It could not outperform the Random Forest model either.
Apparently the Autokeras in its current form is super memory-hungry. I did not manage to fit even a moderate IMDB data set into my memory. So in theory AutoML is a very cool concept, however currently too imatture to be used for real-world applications.
import numpy as np
import tensorflow as tf
import autokeras as ak
index_offset = 3 # word index offset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words = 20,maxlen = 180,
index_from = index_offset)
x_train[0][0:10]
x_test[0][0:10]
x_train = x_train
y_train = y_train.reshape(-1, 1)
x_test = x_test
y_test = y_test.reshape(-1, 1)
y_train.shape
word_to_id = tf.keras.datasets.imdb.get_word_index()
word_to_id = {k: (v + index_offset) for k, v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value: key for key, value in word_to_id.items()}
x_train = list(map(lambda sentence: ' '.join(id_to_word[i] for i in sentence), x_train))
x_test = list(map(lambda sentence: ' '.join(id_to_word[i] for i in sentence), x_test))
x_train = np.array(x_train, dtype=np.str)
x_test = np.array(x_test, dtype=np.str)
x_train.shape
x_train[0]
y_train.shape
y_train
x_test[0]
clf = ak.TextClassifier(verbose=True)
clf.fit(x_train, y_train, time_limit = 60)
#clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
clf.evaluate(x_test, y_test)
Agains same problem. I could not run the TextClassifier either on the benchmark IMDB data or my Neanderthal introgressed vs. depleted texts. Need to wait for better releases of TextClassifier. Still there is not guarantee that AutoML will beat my manual tuning of the Neural Networks, based on the analysis from this blog https://www.pyimagesearch.com/2019/01/07/auto-keras-and-automl-a-getting-started-guide/.
import autokeras as ak
from autokeras import TextClassifier
classifier = TextClassifier(verbose = True)
classifier.fit(X_train, y_train, time_limit = 60 * 60)
classifier.final_fit(X_train, y_train, X_test, y_test, retrain = True)
y_train = [int(i) for i in y_train]
scores = classifier.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Bag of Words is a good model that compares K-mer frequencies between Neanderthal introgressed vs. depleted regions, it achieves quite a high accuracy of sequence classification. However, it does not take connections between the words / K-mers into account. We can do a more advanced classification using Convolutional Neural Networks (CNNs) with special Embedding Layer in order to incorporate more memory into our model, so that the words / K-mers remember their order in the sentence / sequence. The Embedding layer provides a way to learn word embeddings / numeric representations while performing the classification task. We again start with reading the sequences from the two fasta-files (introgressed and depleted regions), split them into words / K-mers and build sentences out of them.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
cutoff = 4000
my_intr_seq = str(intr.seq)[0:cutoff]
my_depl_seq = str(depl.seq)[0:cutoff]
intr_seqs.append(my_intr_seq)
depl_seqs.append(my_depl_seq)
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
merge_texts = intr_texts + depl_texts
len(merge_texts)
import numpy as np
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
print(len(labels))
Now we are going to convert the Neanderthal introgressed and depleted texts into integers to be input into the Kears Embedding Layer. After this all the unique words will be reprsented by an integer. For this we are using fit_on_texts function from the Tokenizer class of Keras. The Keras Embedding layer requires all individual documents to be of same length. Hence we wil pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer the input_length will be equal to the length (i.e. number of words) of the document with maximum length or maximum number of words. To pad the shorter documents I am using pad_sequences functon from the Keras library.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
#cv = CountVectorizer()
#X = cv.fit_transform(merge_texts)
#tfidf_transformer = TfidfTransformer()
#X = tfidf_transformer.fit_transform(X)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(merge_texts)
#X = tokenizer.texts_to_matrix(merge_texts, mode = 'freq')
encoded_docs = tokenizer.texts_to_sequences(merge_texts)
max_length = max([len(s.split()) for s in merge_texts])
X = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post')
#X = np.int32(X.toarray())
print(X)
print('\n')
print(X.shape)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size = 0.20, random_state = 42)
print('Train input data dimensions:')
print(X_train.shape)
print(len(y_train))
print('\n')
print('Test input data dimensions:')
print(X_test.shape)
print(len(y_test))
print("The maximum number of words in any document: ", max_length)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size, i.e. number of unique words in the text: ', vocab_size)
Now all the documents are of same length (after padding). And so now we are ready to create and use the embeddings. Here I will design the Convolutional Neural Network (CNN) with an Embedding Layer. I will embed the words into vectors of 100 dimensions.
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length = max_length))
#model.add(Dropout(0.5))
model.add(Conv1D(filters = 16, kernel_size = 5, activation = 'relu'))
model.add(MaxPooling1D(pool_size = 2))
model.add(Flatten())
#model.add(Dense(8, activation = 'relu'))
#model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
epochs = 50
lrate = 0.0001
decay = lrate / epochs
#sgd = SGD(lr = lrate, momentum = 0.99, nesterov = True)
sgd = SGD(lr = lrate, momentum = 0.9, nesterov = False)
#sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['binary_accuracy'])
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_binary_accuracy', verbose = 1,
save_best_only = True, mode = 'max')
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(X_train, y_train,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(X_test)
cm = confusion_matrix(y_test, [np.round(i[0]) for i in predicted_labels])
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Bag of Words is a good model that compares K-mer frequencies between Neanderthal introgressed vs. depleted regions, it achieves quite a high accuracy of sequence classification. However, it does not take connections between the words / K-mers into account. We can do a more advanced classification using LSTM with special Embedding Layer in order to incorporate more memory into our model, so that the words / K-mers remember their order in the sentence / sequence. The Embedding layer provides a way to learn word embeddings / numeric representations while performing the classification task. We again start with reading the sequences from the two fasta-files (introgressed and depleted regions), split them into words / K-mers and build sentences out of them.
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/')
intr_file = 'hg19_intr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
#cutoff = 200
#my_intr_seq = str(intr.seq)[0:cutoff]
#my_depl_seq = str(depl.seq)[0:cutoff]
#intr_seqs.append(my_intr_seq)
#depl_seqs.append(my_depl_seq)
step = 200; jump = 1; a = 0; b = step; n_jumps = 5
for j in range(n_jumps):
s_intr = str(intr.seq)[a:b]
s_depl = str(depl.seq)[a:b]
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + jump
b = a + step
e = e + 1
if e%20000 == 0:
print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
kmer = 10
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]
merge_texts = intr_texts + depl_texts
len(merge_texts)
import numpy as np
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
print(len(labels))
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.filterwarnings('ignore')
#cv = CountVectorizer()
#X = cv.fit_transform(merge_texts)
#tfidf_transformer = TfidfTransformer()
#X = tfidf_transformer.fit_transform(X)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(merge_texts)
#X = tokenizer.texts_to_matrix(merge_texts, mode = 'freq')
encoded_docs = tokenizer.texts_to_sequences(merge_texts)
max_length = max([len(s.split()) for s in merge_texts])
X = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post')
print(X)
print('\n')
print(X.shape)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size = 0.20, random_state = 42)
print(X_train.shape)
print(X_test.shape)
max_length = max([len(s.split()) for s in merge_texts])
print(max_length)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout
from keras.layers import Embedding, GlobalAveragePooling1D, LSTM, SimpleRNN, GRU, Bidirectional
model = Sequential()
model.add(Embedding(vocab_size, 10)) #dropout = 0.2 #input_length = max_length
#model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', activation = 'relu'))
#model.add(MaxPooling1D(pool_size = 2))
#model.add(LSTM(10)) #dropout = 0.2, recurrent_dropout = 0.2
model.add(Bidirectional(LSTM(10))) #dropout = 0.2, recurrent_dropout = 0.2
#model.add(Bidirectional(SimpleRNN(10)))
#model.add(GRU(10))
#model.add(SimpleRNN(10, dropout = 0.2, recurrent_dropout = 0.2))
#model.add(Flatten())
model.add(Dense(10, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
epochs = 5
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = 'SGD', metrics = ['accuracy'])
#model.compile(loss = 'binary_crossentropy', optimizer = RMSprop(lr = 0.0001), metrics = ['accuracy'])
checkpoint = ModelCheckpoint("weights.best.hdf5", monitor = 'val_acc', verbose = 1,
save_best_only = True, mode = 'max')
print(model.summary())
import warnings
warnings.filterwarnings('ignore')
history = model.fit(X_train, y_train,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True,
callbacks = [checkpoint])
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
model.load_weights("weights.best.hdf5")
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['accuracy'])
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import itertools
plt.figure(figsize = (15,10))
predicted_labels = model.predict(X_test)
cm = confusion_matrix(y_test, [np.round(i[0]) for i in predicted_labels])
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(X_test, y_test, verbose = 0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Now let us visualize the word embeddings from the Embedding layer.
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
words = [i.upper() for i in list(tokenizer.index_word.values())]
words[0:5]
import os
import io
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr')
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for num, word in enumerate(words):
vec = weights[num + 1] # skip 0, it's padding.
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()
For visualization we will be using Embedding Projector from here http://projector.tensorflow.org/ by following the tutorial https://www.tensorflow.org/tutorials/text/word_embeddings:
from IPython.display import Image
path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/NeandIntr/'
Image(path + 'WordEmbeddings.gif.png', width=2000)